From: "World Chess Championship", INTERNET:newsletter@mark-weeks.com Date: 00/10/16, 09:31 Re: Chess History on the Web (2000 no.20) Site review - UPITT (I) Edward Winter's column 'Chess Lore' ran at the Chess Cafe from 1997 to 1999. While browsing through my archived copies, I found the column 'Wanted' from May 1999, where Winter offered his wish list for chess history research and wrote, 'On the subject of present-day masters, Kasparov seems, remarkably enough, to be a less popular subject for books than he was in the 1980s. [...] A compilation of all his games, whether annotated or not, must surely appear sooner or later, and it is curious that no writer has yet undertaken this task.' This triggered the question, 'How well does UPITT address this need'? Most chess historians know by now that the University of Pittsburgh (UPITT) archive, at address... http://www.pitt.edu/~schach/ ...is the single largest collection of chess game scores to be found on the Web. UPITT has collections of game score in several popular digital formats, including the ubiquitous Portable Game Notation (PGN) format. The collections are categorized by player, event, & opening system. I worked extensively with the files a few years ago and discovered that, although they are far from perfect, they are a valuable source of digital game scores. I've been wanting to review the UPITT archive for some time, so I decided to take a close look at the Kasparov game collection. It seemed doubly appropriate since Kasparov is defending his world champion title for the first time in five years. After five rounds, the score is 3-2 in favor of Kramnik, although most observers expect Kasparov to win the 16 game match. I downloaded the Kasparov PGN collection from UPITT. The current version of the file is KASP3-PG.ZIP, which contains two files:- - KASPAROV.PGN - FILE_ID.DIZ The DIZ file informs us that KASP3-PG covers 'Garry Kasparov, 1973-1998: 1878 Games. Updated & expanded! Replaces KASP2-xx.ZIP.' I extracted the seven basic PGN headers and built a database to analyze the content of the file, which confirmed that the UPITT file has 1878 games. The last event on file is Linares 1998. I then separated the PGN file by year & event, so that all games played in the same event were isolated into a single PGN file. This made it easier to keep track of individual games and to manipulate the PGN text. I found many games with only minimum information on the circumstances surrounding the event -- only a date and 'USSR', for example. Working with my database and small files, I removed duplicate games. These are not duplicates in the chess sense. Chess database software usually provides a function to identify and remove duplicate game scores automatically; these are games in which all the moves of both games are identical. Unfortunately, this useful function misses many games which are duplicate in the historical sense. The moves of two or more scores may have significant differences, but they cover the same game. I found, for example, two versions of game 1 from Kasparov's 1984 title match against Karpov. Duplicate game scores arise from several factors. Many of them are caused by almost inconsequential differences in move order -- 23.Nxe4 is immediately followed by 24.Nxe4, where one version has a knight on g5 capturing first, while another version has the other knight on c5 capturing first. Many of these are transpositions in the opening, where one version opens with 1.e4, a second opens with 1.d4, and seven moves later both versions converge into the same variation. These opening variations are not inconsequential, since grandmasters routinely use transpositions to steer a game into favorite lines while avoiding the opponent's favorite lines. This makes the exact opening variation important in interpreting the historical record of the event. I also found duplicate games where one version contains an outright blunder; one player leaves a queen en prise, while the the other player fails to capture it. These are almost always due to data entry errors, although I found one event with so many duplicates containing blunders that I suspected someone had planted the bad versions maliciously -- a PGN virus! The last common category of duplicates consisted of endgames where one version has additional moves to show why a player resigned or why a draw was agreed. I did my best to identify the correct version of each duplicate, but this is not a task to be accomplished in the time available for this review. I found 348 duplicates in all, leaving 1530 unique games on file. There are undoubtedly a few more duplicates in this collection, but I am confident that they are very few. The duplicate scores account for 18.5% of the games on the original UPITT file, which renders that file almost useless for generating meaningful statistics. The percentage of duplicates drops from 28.3% over the period 1973 through 1984 to 2.0% over 1995 to 1998. Since most of the older game scores were entered manually, this shows that digital game scores generated by the organizers of an event or by professional chess journalists like Mark Crowther are having a favorable impact on the historical record. After removing the duplicates, I tried to organize the events chronologically by checking the digital game scores against a written record. The result of this was a set of files using the naming convention 'YYM-xxxx', where:- - YY = year - M = month, e.g. A = January ... L = December (M = month unknown, X = no written record found) - xxxx = shorthand for the venue of the event, e.g. 'LOND' for London. This means that the file containing the 48 games from Kasparov's first title match with Karpov are in a file named 84I-MOSC.PGN, which indicates that the match started in September 1984 and was played in Moscow. This naming convention has the advantage that file names sort in chronological order. I had to adopt a few more conventions. The year of the event is the year in which it started, so all of the games from the first match with Karpov, which started in 1984 and ended in 1985, are in a file bearing the year 1984. Where an event was played in more than one venue, I similarly used the name of the first, so the games from K-K III, played in London & Leningrad, are in 86G-LOND.PGN. Round numbers in team events correspond to the rounds the team played, which are often more rounds than Kasparov played. Finally, I compressed the PGN files into a single ZIP file and created a Web page as an index to the events. The index page is at address... http://members.tripod.com/~Mark_Weeks/chw00j15/kasparov.htm ...from which the ZIP file is linked. This record is only preliminary and has many holes, but it is historically more accurate than the original UPITT file. It should be clear that neither the UPITT file nor my new file is the compilation of Kasparov's games that Winter had in mind. Much work remains to be done. Many of the events, especially the exhibitions, are not identified properly. The spelling of names, events, & sites is not consistent across the game scores or across the files. Many of the events lack round numbers for the games. I discovered that one major tournament in 1996 was missing completely, while the last two and a half years of Kasparov's career need to be added. Exhibition games should be segregated and all games should indicate the time control, especially since so many events played in the 1990s were active chess events. Perhaps most importantly, the PGN game scores have not been checked against a second source. This could easily be the most time consuming task to construct a game compilation from a UPITT collection. It also raises new issues. How do I decide the real score of a game, given that there are differences in traditional printed source? How do I indicate that a file has been checked against another source while providing the name of the source? I hope to answer these questions as I continue to work on my compilation of Kasparov's games. Bye for now, Mark Weeks